It was mentioned in the chapter that a cubic regression spline with one knot \(\xi\) can be obtained by using a basis of the form \(x\), \(x^2\), \(x^3\), \((x - \xi)^3_+\) where \((x - \xi)^3_+ = (x - \xi)^3\) if \(x > \xi\) and 0 otherwise.
We will now show that a function of the form \[ f(x) = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + \beta_4(x - \xi)^3_+ \]
is indeed a cubic regression spline, regardless of the coefficients \(\beta_j\).
Find a cubic polynomial \[ f_1(x) = a_1 + b_1x + c_1x^2 + d_1x^3 \] such that \(f(x) = f_1(x) \ \forall x \leq \xi\). Express \(a_1, b_1, c_1, d_1\) in terms of \(\beta_j\).
When \(x \leq \xi\), \((x - \xi)^3_+ = 0\). Therefore, \[ \begin{align} f_1(x) &= \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + \beta^4(0)\\ f_1(x) &= \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3\\ a_1 + b_1x + c_1x^2 + d_1x^3 &= \beta_0 + \beta1x + \beta_2x^2 + \beta_3x^3\\ \end{align} \]
This means \[ [a_1, b_1, c_1, d_1] = [\beta_0, \beta_1, \beta_2, \beta_3] \]
Find a cubic polynomial \[ f_2(x) = a_2 + b_2x + c_2x^2 + d_2x^3 \]
such that \(f(x) = f_2(x) \ \forall x > \xi\). Express \(a_2, b_2, c_2, d_2\) in terms of \(\beta_j\).
When \(x > \xi\), \((x - \xi)^3_+ = (x - \xi)^3\). This means \[ \begin{aligned} f_2(x) &= \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4(x - \xi)^3\\ f_2(x) &= \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \beta_4[x^3 - 3x^2\xi + 3x\xi^2 - \xi^3]\\ f_2(x) &= (\beta_0 - \beta_4\xi^3) + (\beta_1 + 3\beta_4\xi^2)x + (\beta_2 - 3\beta_4\xi)x^2 + (\beta_3 + \beta_4)x^3\\ a_2 + b_2x + c_2x^2 + d_2x^3 &= (\beta_0 - \beta_4\xi^3) + (\beta_1 + 3\beta_4\xi^2)x + (\beta_2 - 3\beta_4\xi)x^2 + (\beta_3 + \beta_4)x^3 \end{aligned} \]
This means \[ [a_2, b_2, c_2, d_2] = [\beta_0 - \beta_4\xi^3, \beta_1 + 3\beta_4\xi^2, \beta_2 - 3\beta_4\xi, \beta_3 + \beta_4] \]
We have now established that the cubic spline is a piecewise polynomial.
Show that \(f_1(\xi) = f_2(\xi)\) i.e. \(f(x)\) is continuous at \(\xi\).
We evaluate expressions for \(f_1(\xi)\) and \(f_2(\xi)\).
$$ \[\begin{align} f_1(\xi) &= \beta_0 + \beta_1\xi + \beta_2\xi^2 + \beta_3\xi^3\\ \text{and}\\ f_2(\xi) &= \beta_0 - \beta_4\xi^3 + (\beta_1 + 3\beta_4\xi^2)(\xi) + (\beta_2 - 3\beta_4\xi)(\xi)^2 + (\beta_3 + \beta_4)(\xi)^3 \\ f_2(\xi) &= \beta_0 - \beta_4\xi^3 + \beta_1\xi + 3\beta_4\xi^3 + \beta_2\xi^2 - 3\beta_4\xi^3 + \beta_3\xi^3 + \beta_4\xi^3\\ f_2(\xi) &= \beta_0 + \beta_1\xi + \beta_2\xi^2 + \beta_3\xi^3 + 3\beta_4\xi^3 - 3\beta_4\xi^3 + \beta_4\xi^3 - \beta_4\xi^3\\ f_2(\xi) &= \beta_0 + \beta_1\xi + \beta_2\xi^2 + \beta_3\xi^3 \end{align}\] $$
Hence proved.
Show that \(f_1' (\xi) = f_2' (\xi)\) i.e. \(f'(x)\) is continuous at \(\xi\).
For function \(f_1(x)\) \[ \begin{align} f_1(x) &= \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3\\ f_1'(x) &= \beta_1 + 2\beta_2x + 3\beta_3x^2\\ f_1'(\xi) &= \beta_1 + 2\beta_2\xi + 3\beta_3\xi^3 \end{align} \]
Likewise, for \(f_2(x)\) \[ \begin{align} f_2(x) &= (\beta_0 - \beta_4\xi^3) + (\beta_1 + 3\beta_4\xi^2)x + (\beta_2 - 3\beta_4\xi)x^2 + (\beta_3 + \beta_4)x^3\\ f_2'(x) &= (\beta_1 + 3\beta_4\xi^2) + 2(\beta_2 - 3\beta_4\xi)x + 3(\beta_3 + \beta_4)x^2\\ f_2'(\xi) &= \beta_1 + 3\beta_4\xi^2 + 2(\beta_2 - 3\beta_4\xi)\xi + 3(\beta_3 + \beta_4)(\xi)^2\\ f_2'(\xi) &= \beta_1 + 3\beta_4\xi^2 +2\beta_2\xi - 6\beta_4\xi^2 + 3\beta_3\xi^2 + 3\beta_4\xi^2\\ f_2'(\xi) &= \beta_1 + 2\beta_2\xi +3\beta_3\xi^2 + 3\beta_4\xi^2 + 3\beta_4\xi^2 - 6\beta_4\xi^2\\ f_2'(\xi) &= \beta_1 + 2\beta_2\xi + 3\beta_2\xi^2 \end{align} \]
Hence proved.
Show that \(f_1'' (\xi) = f_2'' (\xi)\) i.e. \(f''(x)\) is continuous at \(\xi\).
For function \(f_1(x)\) \[ \begin{align} f_1'(x) &= \beta_1 + 2\beta_2x + 3\beta_3x^2\\ f_1''(x) &= 2\beta_2 + 6\beta_3x\\ f_1''(\xi) &= 2\beta_2 + 6\beta_3\xi \end{align} \]
Likewise, for \(f_2(x)\) \[ \begin{align} f_2'(x) &= (\beta_1 + 3\beta_4\xi^2) + 2(\beta_2 - 3\beta_4\xi)x + 3(\beta_3 + \beta_4)x^2\\ f_2''(x) &= 2(\beta_2 - 3\beta_4\xi) + 6(\beta_3 + \beta_4)x\\ f_2''(\xi) &= 2\beta_2 - 6\beta_4\xi + 6\beta_3\xi + 6\beta_4\xi\\ f_2''(\xi) &= 2\beta_2 + 6\beta_3\xi - 6\beta_4\xi + 6\beta_4\xi\\ f_2''(\xi) &= 2\beta_2 + 6\beta_3\xi \end{align}\\ \]
Hence proved.
Through this derivation, we have established - the cubic regression spline is a piecewise function which is continuous at \(\xi\).
the spline’s first derivative is continuous at \(\xi\).
the spline’s second derivative is also continuous at \(\xi\).
the second derivative being continuous means that the \(f(x)\) is also smooth at the knot \(\xi\).
Suppose that a curve \(\hat{g}\) is computed to smoothly fit a set of points using the following formula. \[ \hat{g} = \text{arg} \ \underset{g}{min}\bigl(\sum\limits_{i = 1}^n(y_i - g(x_i))^2 + \lambda \int \bigl[g^{(m)}(x)\bigr]^2dx\bigr) \]
where \(g^{(m)}\) represents the \(m\)th derivative of \(g\) (and \(g^{(0)} = g\)). Provide example sketches of \(\hat{g}\) in each of the following scenarios.
For this exercise, I’ll use the methodology outlined by Liam Morgan in his solutions.
\(g\) is a data generating function, which I’ll arbitrarily assume to be
\[ \begin{align} g(X) &= X^2 + sin(163(X + 0.164))\\ \Rightarrow Y &= g(X) + \epsilon\\ \end{align} \]
such that \[ X \sim \mathcal{U}[0, 1]\\ \epsilon \sim \mathcal{N}(0, 1) \]
Implementing this data generating process with code.
set.seed(1)
X <- runif(50)
eps <- rnorm(n = 50, mean = 0, sd = 1)
Y <- X^2 + sin(10 * (X + 10)) + eps
gen.fn <- function(x){x^2 + sin(10 * (x + 10))}
# Test with a plot
ggplot(data.frame(X, Y), aes(x = X, y = Y)) +
geom_point() +
stat_function(fun = gen.fn, aes(col = 'Generating Function')) +
theme(legend.position = 'bottom')
The intuition here is that we want to find out the value \(g\) that will minimize the residual sum of squares subject to some smoothing penalty.
As \(\lambda \rightarrow \infty\), the penalty term’s magnitude increases, which means the smoothing spline is incentivized to choose a form \(\hat{g}\).
As \(\lambda \rightarrow 0\), the penalty term is eliminated, which means the \(\hat{g}\) can be as variable as required to fully interpolate the data, and there is no penalty associated with any variability of the slope in \(g\).
\(m\) controls the “order” of the slope that is minimised. Generally, \(m\) means the \(\lambda\) penalty incentivizes choice of a \(\hat{g}\)
\(\lambda = \infty, m = 0\)
With \(\m = 0\), the equation becomes
\[ \hat{g} \approx \text{arg} \ \underset{g}{min}\bigl(\lambda \int \bigl[g^{(0)}(x)\bigr]^2dx \bigr) \]
This means the smoothing penalty is applied to \(g^{(0)}\). As \(\lambda \rightarrow \infty\), \(\hat{g} \approx \lambda \int \bigl[g^{(0)}(x)\bigr]^2dx\bigr)\) i.e. the penalty term will apply to the sum of squares of the original function \(\g\). This essentially forces \(g = 0\), which means the selected form of \(\hat{g}\) will be \(\hat{g} = 0\), which is the value of \(g\) that minimises the sum of squared \(g\).
my.df <- data.frame(
x = X,
y = Y,
y_pred = 0
)
ggplot(my.df, aes(x = x)) +
geom_point(aes(y = y)) +
geom_smooth(aes(y = y_pred, col = 'Chosen Function')) +
stat_function(fun = gen.fn, aes(col = 'Generating Function'))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
\(\lambda = \infty, m = 1\)
The point about the penalty term dominating the fit still applies. However, this time the penalty is applied to \[ \hat{g} \approx \text{arg} \ \underset{g}{min}\bigl(\lambda \int \bigl[g^{(1)}(x)\bigr]^2dx \bigr) \]
Which means the penalty is applied to the slope of \(g\), and forces \(g^{(1)} \rightarrow 0\). This can happen when the \(g(x)\) is a straight line. Tecnically, any constant function \(g(x) = k\) would suffice to minimize the penalty term, but we also need to implicitly minimise the RSS term, which is done by
\[ g(x) = \frac{1}{N} \sum_{i = 1}^{N} {y_i^2} \]
my.df <- data.frame(x = X, y = Y, y_pred = mean(Y))
my.df %>%
ggplot(aes(x = x)) +
geom_point(aes(y = y)) +
geom_line(aes(y = y_pred, col = 'Chosen function')) +
stat_function(fun = gen.fn, aes(col = 'Generating Function'))
\(\lambda = \infty, m = 2\)
Applying a similar logic to this condition, the equation becomes dominated by the penalty term and the term being minimised becomes
\[ \hat{g} \approx \text{arg} \ \underset{g}{min}\bigl(\lambda \int \bigl[g^{(2)}(x)\bigr]^2dx \bigr) \] Which is the same as minimising a function \(g\) such that its second derivative is zero i.e. the slope does not change with \(x\). This can only occur for a straight line, which can be approximated by the linear model because it will also minimize the RSS.
my.df <- data.frame(
x = X, y = Y,
y_pred = predict(lm(Y ~ X, data = data.frame(X, Y)), new.data = list(X))
)
my.df %>%
ggplot(aes(x = X)) +
geom_point(aes(y = y)) +
geom_line(aes(y = y_pred, col = 'Chosen Function')) +
stat_function(fun = gen.fn, aes(col = 'Generating Function'))
\(\lambda = \infty, m = 3\)
When \(m = 3\), the term being optimized or minimized is
\[ \hat{g} \approx \text{arg} \ \underset{g}{min}\bigl(\lambda \int \bigl[g^{(3)}(x)\bigr]^2dx \bigr) \] The integrand is minimisd when the second derivative of \(g(x)\) is a constant. This requirement is satisfied by a quadratic equation of the form \(ax^2 + bx + c\). To choose a specific quadratic equation, we can try to minimise the quadratic equation through sum of least squares.
my.df <- data.frame(
x = X, y = Y,
y_pred = predict(lm(Y ~ poly(X, 2)), newdata = list(X))
)
my.df %>%
ggplot(aes(x = X)) +
geom_point(aes(y = Y)) +
geom_line(aes(y = y_pred, col = 'Chosen Function')) +
stat_function(fun = gen.fn, aes(col = 'Generating Function'))
\(\lambda = 0, m = 3\)
When \(\lambda = 0\), the smoothing spline equation simplifies to \[ \hat{g} = \text{arg} \ \underset{g}{min}\bigl(\sum\limits_{i = 1}^n(y_i - g(x_i))^2 \] This means there no constraint on the regression spline to be smooth. It can be as variable as required to fully interpolate the data to minimise training RSS.
my.spline.fit <- smooth.spline(x = X, y = Y, all.knots = TRUE, lambda = 1e-15)
my.spline.pred <- predict(
my.spline.fit, x = seq(min(X) - 0.02, max(X) + 0.02, by = 0.0001)
)
data.frame(x = X, y = Y) %>% ggplot(aes(x = x, y = y)) +
geom_point() +
stat_function(fun = gen.fn, aes(col = 'Generating Function')) +
geom_line(data = data.frame(x = my.spline.pred$x, y = my.spline.pred$y),
aes(x = x, y = y, col = 'Chosen Function'))
Using LOOCV, we can get a much better regression spline that uses the optimal value of \(\lambda\).
my.spline.loocv.fit <- smooth.spline(x = X, y = Y, all.knots = T)
my.spline.loocv.pred <- predict(
my.spline.loocv.fit, x = seq(min(X) - 0.02, max(X) + 0.02, by = 0.001)
)
data.frame(x = X, y = Y) %>% ggplot(aes(x = x, y = y)) +
geom_point() +
stat_function(fun = gen.fn, aes(col = 'Generating Function')) +
geom_line(data = data.frame(
x = my.spline.loocv.pred$x, y = my.spline.loocv.pred$y),
aes(x = x, y = y, col = 'Chosen Function'))
With LOOCV to use \(\lambda\) and a knot at each data point, we have managed to generate a smoothing spline that is very, very similar to the actual chosen function, despite the noise in the data.
Suppose we fit a curve with the basis functions \(b_1(X) = X, b_2(X) = (X - 1)^2I(X \geq 1)\). We fit the linear regression model \[ Y = \beta_0 + \beta_1b_1(X) + \beta_2b_2(X) + \epsilon \]
and obtain the coefficient estimates \(\hat\beta_0 = 1, \ \hat{\beta_1} = 1, \ \hat{beta_2} = 3\). Sketch the estimated curve between \(X = -2\) and \(X = 6\). Note the intercepts, slopes, and other relevant information.
After substituting the basis functions, the curve becomes \[ Y = \beta_0 + \beta_1(X) + \beta_2(X-1)^2I(X \geq 1) \]
For \(X < 1\), this means the equation is just \[ \begin{align} Y_1 &= \beta_0 + \beta_1(X)\\ Y_1 &= 1 + X \end{align} \]
And for \(X \geq 1\) it assumes the full form \[ \begin{align} Y_2 &= \beta_0 + \beta_1(X) + \beta_2(X - 1)^2\\ Y_2 &= \beta_0 + \beta_1(X) + \beta_2(X^2 - 2X + 1)\\ Y_2 &= \beta_0 + \beta_1X + \beta_2X^2 - 2\beta_2X + \beta_2\\ Y_2 &= \beta_0 + \beta_2 + (\beta_1 - 2\beta_2)X + \beta_2X^2\\ Y_2 &= 1 - 2 + (1 - 2(-2))X + (-2)X^2\\ y_2 &= -1 + 5X -2X^2 \end{align} \]
Based on this information, we expect a piecewise function that is a straight line with intercept = \(\beta_0 = 1\) and slope \(\beta_1 = 1\) prior to \(X = 1\) and a quadratic curve after \(X = 1\) with an offset of \(\beta_0 + \beta_2\).
x <- seq(-2, 2, by = 0.01)
y <- sapply(x, function(x){
if (x < 1) { 1 + x } else { -1 + 5* x - 2 *x^2}
})
ggplot(data.frame(x = x, y = y), aes(x = x, y = y)) +
geom_line() +
geom_vline(xintercept = 1, color = 'red')
This choice of basis functions results in a fit that is continuous at the cutover point \(c_1 = 1\).
Suppose we fit a curve with the basis functions \[ b_1(X) = I(0 \leq X \leq 2) - (X - 1)I(1 \leq X \leq 2)\\ b_2(X) = (X - 3)I(3 \leq X \leq 4) + I(4 < X \leq 5) \]
We fit the linear regression model \[ Y = \beta_0 + \beta_1b_1(X) + \beta_2b_2(X) + \epsilon \]
and obtain coefficient estimates \(\hat{\beta_0} = 1, \hat{\beta_1} = 1, \hat{\beta_2} = 3\).
Sketch the estimated curve between \(X = -2\) and \(X = 6\). Note the intercepts, slopes, and other relevant information.
The first step is to substitute the basis functions in the linear model for \(X \leq 2\) \[ \hat{y} = \beta_0 + \beta_1[I(0 \leq X \leq 2) - (X - 1)I(1 \leq X \leq 2)] + \beta_2[(X - 3)I(3 \leq X \leq 4) + I(4 < X \leq 5)]\\ \text{For X} \leq 2\\ \hat{y} = \beta_0 + \beta_1[I(0 \leq X \leq 2) - (X - 1)I(1 \leq X \leq 2)]\\ \hat{y} = 1 + 1[I(0 \leq X \leq 2) - (X - 1)I(1 \leq X \leq 2)]\\ \]
For all \(X < 0\), neither indicator variable is true \(\Rightarrow\) = 1
For all $ X $, the first indicator variables is true. This makes the equation \[ \hat{y} = 1 + 1(1 + 0) = 2 \]
For all \(X \in [1, 2]\), both indicator variables are true. This makes the equation \[ \hat{y} = 1 + 1[1 - (X - 1)] = 1 + 1 - X + 1 = 3 - X \]
We do similar analysis for \(X \geq 2\).
None of the indicator variables are going to be true between 2 and 3, so \(\hat{y} = \beta_0 = 1\).
For \(X \in [3, 4]\), \(\hat{y} = \beta_0 + \beta_2[X - 3] = 1 + 3(X - 3) = -2 + 3X\)
For \(X \in [4, 5]\), \(\hat{y} = \beta_0 + \beta_2[1] = 1 + 3 = 4\)
For \(X \in [5, 6]\), none of the indicator variables are active. So we assume \(\hat{y} = \beta_0 = 1\).
Generating the plot based on these rules
x <- seq(-2, 6, by = 0.01)
y <- sapply(x, function(x){
dplyr::case_when(
x < 0 ~ 1,
x >= 0 & x <= 1 ~ 2,
x >= 1 & x <= 2 ~ 3 - x,
x >= 2 & x <= 3 ~ 1,
x >= 3 & x < 4 ~ 2 - 3 * x,
x >= 4 & x < 5 ~ 4,
x > 5 & x <= 6 ~ 1,
TRUE ~ 1
)
})
ggplot(data.frame(x = x, y = y), aes(x = x, y = y)) +
geom_line() +
geom_vline(xintercept = c(0, 1, 2, 3, 4, 5, 6), linetype = 'dotted', color = 'red')
Doesn’t look right. Isn’t continuous. All over the place.
Consider two curves and defined by
$$ \[\begin{align} \hat{g_1} &= \text{arg} \underset{g}{min} \left( \sum_{i = 1}^N(y_i - g(x_i))^2 + \lambda\int[g^{(3)}(x)]^2dx\right)\\ \hat{g_2} &= \text{arg} \underset{g}{min} \left( \sum_{i = 1}^N(y_i - g(x_i))^2 + \lambda\int[g^{(4)}(x)]^2dx\right)\\ \end{align}\] $$ Where \(g^{(m)}\) represents the \(m^{th}\) derivative of \(g\).
The only difference between these two functions is that the penalty term minimises an integral of a different order of derivative. For \(\hat{g_1}\) this is the 3rd order derivative, which means penalty forces \(g\) to take a form with minimal variation in the 2nd derivative (as that would translate to 0 variation in 3rd derivative). This means the form of equation favoured by \(\hat{g_1}\) will be somewhat quadratic.
By the same logic, \(\hat{g_2}\) favours a cubic polynomial because expectation is that 4th derivative of a cubic polynomial will be 0.
The extent to which the chosen form of \(\hat{g_1}\) or \(\hat{g_2}\) will follow these quadratic or cubic ideals depends on the magnitude of the \(\lambda\) parameter.
As \(\lambda \rightarrow \infty\), will \(\hat{g_1}\) or \(\hat{g_2}\) have the smaller training RSS?
\(\lambda\) penalty is at maximum possible value so \(\hat(g_1)\) and \(\hat(g_2)\) become quadratic and cubic polynomials respectively. Of the two, the cubic polynomial of \(\hat{g_2}\)is more flexible, and will be able to better fit to the training data, resulting in a lower training RSS
As \(\lambda \rightarrow \infty\), will \(\hat{g_1}\) or \(\hat{g_2}\) have the smaller test RSS?
The effect of the \(\lambda\) penalty is still prevalent as discussed in part (b). However, apriori it is difficult to say which of the two curves will have a smaller test RSS.
If the true relationship in the data is close to linear or quadratic, then the quadratic form of \(\hat{g_1}\) will have lower test RSS. If the true relationship is highly non-linear then the cubic form of \(\hat{g_2}\) will have lower test RSS.
As a general rule, though, we expect test RSS for the quadratic form of \(\hat{g_1}\) to be lower because it is less flexible and thus has lower variance than the cubic \(\hat{g_2}\).
As \(\lambda \rightarrow 0\), will \(\hat{g_1}\) or \(\hat{g_2}\) have the smaller training and test RSS?
With the penalty term becoming 0, there is no effect of the variability of the slope on the form of the function chosen. The only requirement is to minimise the training RSS. In this case, both curves will take the same form and try to interpolate as much of the training data as possible. Both will have similar training RSS and test RSS, although this test RSS will be extremely high. Assumption here is that the same form of \(g\) is used in both cases.
WageIn this exercise, you will further analyse the Wage data set considered throughout the chapter.
Perform polynomial regression to predict the wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen? How does it compare to the results of the hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
# Seed random number generator
set.seed(163)
# Make a vector to store cross validated RMSE
results.list <- c()
# Use 5 fold validation with 20 degrees
k.folds <- 5
degrees <- seq(1, 15)
# Iterate for 10 degree polynomials
for (d in degrees) {
# Fit a GLM with the age raised to degree `d` features
poly.glm <- glm(formula = wage ~ poly(age, degree = d, raw = TRUE), data = Wage)
cv.err <- cv.glm(data = Wage, glmfit = poly.glm, K = k.folds)
# Append the cross-validation error to the list
results.list <- c(results.list, cv.err$delta[1])
}
plot(degrees, results.list, type = 'b', xlab = 'Polynomial Degree',
ylab = 'Cross Validated MSE', main = 'Wage - CV Error with Polynomial Degree')
points(
which.min(results.list),
min(results.list),
cex = 2,
pch = 20,
col = 'red'
)
Based on cross-validation results, a 10th degree polynomial seems to give the lowest MSE.
Testing the same approach with anova.
anova(
lm(wage ~ poly(age, 1), data = Wage),
lm(wage ~ poly(age, 2), data = Wage),
lm(wage ~ poly(age, 3), data = Wage),
lm(wage ~ poly(age, 4), data = Wage),
lm(wage ~ poly(age, 5), data = Wage),
lm(wage ~ poly(age, 6), data = Wage),
lm(wage ~ poly(age, 7), data = Wage),
lm(wage ~ poly(age, 8), data = Wage),
lm(wage ~ poly(age, 9), data = Wage),
lm(wage ~ poly(age, 10), data = Wage),
lm(wage ~ poly(age, 11), data = Wage),
lm(wage ~ poly(age, 12), data = Wage),
lm(wage ~ poly(age, 13), data = Wage),
lm(wage ~ poly(age, 14), data = Wage),
lm(wage ~ poly(age, 15), data = Wage)
)
## Analysis of Variance Table
##
## Model 1: wage ~ poly(age, 1)
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Model 6: wage ~ poly(age, 6)
## Model 7: wage ~ poly(age, 7)
## Model 8: wage ~ poly(age, 8)
## Model 9: wage ~ poly(age, 9)
## Model 10: wage ~ poly(age, 10)
## Model 11: wage ~ poly(age, 11)
## Model 12: wage ~ poly(age, 12)
## Model 13: wage ~ poly(age, 13)
## Model 14: wage ~ poly(age, 14)
## Model 15: wage ~ poly(age, 15)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5637 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8867 0.001681 **
## 4 2995 4771604 1 6070 3.8090 0.051070 .
## 5 2994 4770322 1 1283 0.8048 0.369731
## 6 2993 4766389 1 3932 2.4675 0.116329
## 7 2992 4763834 1 2555 1.6034 0.205515
## 8 2991 4763707 1 127 0.0795 0.778016
## 9 2990 4756703 1 7004 4.3952 0.036124 *
## 10 2989 4756701 1 3 0.0017 0.967552
## 11 2988 4756597 1 103 0.0648 0.799144
## 12 2987 4756591 1 7 0.0043 0.947923
## 13 2986 4756401 1 190 0.1189 0.730224
## 14 2985 4756158 1 243 0.1522 0.696488
## 15 2984 4755364 1 795 0.4986 0.480151
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis for an anova that a model \(\mathcal{M_1}\) is sufficient to explain relationship between predictor and response compared to \(\mathcal{M_2}\). So the \(p\)-value is the probability of this null hypothesis being true (roughly speaking) i.e. for row i, it shows the probability that the null hypothesis for \(\mathcal{M_{i-1}}\) being sufficient compared to \(\mathcal{M_{i}}\).
We can see that the \(p\)-values comparing models of degrees 1 - 3 are insignificant. The \(p\)-value for the hypothesis that suggests we should accept a degree 4 model in favour of a degree 3 model is just above the significance threshold.
After this degree, the \(p\)-values are consistently high until we reach the hypothesis which says we should reject the 8-degree polynomial in favour of the 9-degree polynomial, although this is below the 5% threshold for significance.
Immediately afterwards, the \(p\)-value for the null hypothesis that sates we should accept the 9-degree polynomial instead of a more complex 10-degree polynomial.
In conclusion - CV suggests we should use a degree-10 polynomial, although degree 4 also has a very similar CV MSE.
To minimise risk of overfitting, we should ideally use the 4-degree polynomial.
The anova test also supports this hypothesis, because complexity significance is first observed for the 4-degree polynomial.
Fitting a degree 4 and degree 10 polynomial to the data to assess fit.
ggplot(Wage, aes(x = age, y = wage)) +
geom_point(alpha = 0.3) +
geom_smooth(method = 'lm', formula = "y ~ poly(x, degree = 4, raw = T)", aes(col =' degree 4')) +
geom_smooth(method = 'lm', formula = "y ~ poly(x, degree = 10, raw = T)", aes(col = 'degree 10'))
Both fits are very similar, but the 10-degree polynomial seems to overfit at the extreme values of age.
Fit a step function to predict wage using age and perform cross validation to choose the number of cuts. Make a plot of the fit obtained.
We repeat the same process as above, but this time use the cut command to make different intervals of age.
set.seed(163)
cut.results <- c()
cut.intervals <- seq(2, 20)
k.folds <- 5
for (c in cut.intervals) {
# We can't use the conventional process of fitting a linear model with cut
# and then passing this as an argument to cv.glm, because cv.glm will
# implicitly call cut again for each training fold, resulting in different
# levels. Solution is to initialize "global" levels ahead of splitting and
# use them as predictors.
# See here: https://stats.stackexchange.com/questions/472385/why-does-the-glm-function-in-r-give-an-error-message-when-trying-to-fit-a-step
Wage$temp.age.levels <- cut(Wage$age, c)
glm.interval.fit <- glm(wage ~ temp.age.levels, data = Wage)
cv.err <- cv.glm(data = Wage, glmfit = glm.interval.fit, K = k.folds)
cut.results <- c(cut.results, cv.err$delta[1])
}
# Rememebr to remove the temporary variable
Wage$temp.age.levels <- NULL
plot(cut.intervals - 1, cut.results, type = 'b',
xlab = 'Cut Intervals',
ylab = 'Cross-validated MSE',
main = 'Wage - Cross-validated MSE against Intervals used to bin `cut`')
points(which.min(cut.results), min(cut.results), pch = 20, col = 'red', cex = 2)
Cross-validation suggests that 15 intervals will result in the lowest outsample MSE. However, we can see that we get very similar MSE at 10 intervals and again at 12 intervals. Following the rule of thumb that simpler models = better models, it looks like 10 intervals is a better choice.
Fitting both to the data and checking fit.
ggplot(Wage, aes(x = age, y = wage)) +
geom_point(alpha = 0.5) +
geom_smooth(method = 'lm', formula = 'y ~ cut(x, 10)', aes(col = '10 intervals')) +
geom_smooth(method = 'lm', formula = 'y ~ cut(x, 15)', aes(col = '15 intervals')) +
labs(x = 'Age', y = 'Wage', title = 'Predicted Wage against Age for Different Binned versions of Age')
Wage with new FeaturesThe Wage data set contains a number of other features not explored in this chapter, such as marital status marit1, job class jobclass, and others. Explore the relationships between some of these predictors and wage, and use the non-linear fitting techniques in order to fit flexible models to the data. Create plots of the results obtained, and write a summary of your findings.
First, use a pairplot to get a high level overview of the relationships in the data.
pairs(Wage)
Too many features for us to make an informed decision. Need to look at information feature by feature.
# Make a list of features
wage.features.cat <- colnames(Wage)[!(colnames(Wage) %in% c('wage', 'logwage', 'age', 'year'))]
wage.features.num <- c('age', 'year')
# Boxplots and pie charts for the categorical features
for (feature in wage.features.cat) {
baseline.plot <- ggplot(Wage, aes(x = get(feature), group = get(feature), y = wage)) +
geom_boxplot() + labs(x = feature, title = paste0("Feature Profile - ", feature))
prop.plot <- ggplot(
Wage %>% dplyr::count(feature_cat = factor(get(feature))) %>%
dplyr::mutate(pct = n / sum(n) * 100),
aes(x = '', y = n, fill = feature_cat)) +
geom_bar(stat = 'identity', width = 1) +
coord_polar("y", start = 0)
print(cowplot::plot_grid(baseline.plot, prop.plot, ncol = 2))
}
# Scatterplots for numeric features
for (feature in wage.features.num) {
baseline.plot <- ggplot(Wage, aes(x = get(feature), y = wage)) + geom_point() +
labs(x = feature, title = paste0("Feature Profile - ", feature))
print(baseline.plot)
}
Based on these plots, the following feature engineering decisions might be useful 1. maritl: combine separated and divorced into a single category because divorced has low volume but similar distribution over the target. 2. race: similar treatment and for black and other categories. 3. age: could use bins 10 bins instead of a continuous value, or use a polynomial transformation because it gave good RSS with earlier models.
new.wage <- Wage %>%
dplyr::mutate(
maritl = ifelse(maritl %in% c('4. Divorced', '5. Separated'), '6. Div_or_Sep', as.character(maritl)) %>% as.factor(),
race = ifelse(race %in% c('4. Other', '2. Black'), '5. black other', as.character(race)) %>% as.factor(),
)
new.wage.model.01 <- glm(
wage ~ poly(age, 4, raw = T) + year + maritl + race + education + jobclass + health + health_ins,
data = new.wage
)
# Same as before but with higher degree polynomial terms for age and some interaction terms
new.wage.model.02 <- glm(
wage ~ poly(age, 10, raw = T) + year + maritl + race + education + jobclass + health + health_ins +
maritl:health_ins + education:jobclass,
data = new.wage
)
new.wage.cv.01 <- cv.glm(data = new.wage, glmfit = new.wage.model.01, K = 5)
new.wage.cv.02 <- cv.glm(data = new.wage, glmfit = new.wage.model.02, K = 5)
new.wage.cv.01$delta
## [1] 1147.608 1145.985
new.wage.cv.02$delta
## [1] 1148.027 1144.918
anova(new.wage.model.01, new.wage.model.02)
## Analysis of Deviance Table
##
## Model 1: wage ~ poly(age, 4, raw = T) + year + maritl + race + education +
## jobclass + health + health_ins
## Model 2: wage ~ poly(age, 10, raw = T) + year + maritl + race + education +
## jobclass + health + health_ins + maritl:health_ins + education:jobclass
## Resid. Df Resid. Dev Df Deviance
## 1 2982 3399010
## 2 2969 3363795 13 35215
Skipping to question 9 because I don’t have time.
Boston Data SetThis question uses the variables dis (the weighted mean of distance to five Boston employment centers) and nox (nitrogen oxides concentration in PP10M) from the Boston data. We will treat dis as the predictor and nox as the response.
Use the poly() function to fit a cubic polynomial regression to predict nox using dis. Report the regression output, and plot the resulting data and polynomial fits.
# Train a model and report the cross-validated error
boston.cubic.fit <- glm(nox ~ poly(dis, degree = 3, raw = TRUE), data = Boston)
boston.cubic.cv <- cv.glm(data = Boston, glmfit = boston.cubic.fit, K = 10)
boston.cubic.cv$delta
## [1] 0.003857155 0.003855211
# What is the statistical significance of each term and training set performance
summary(boston.cubic.fit)
##
## Call:
## glm(formula = nox ~ poly(dis, degree = 3, raw = TRUE), data = Boston)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.121130 -0.040619 -0.009738 0.023385 0.194904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9341281 0.0207076 45.110 < 2e-16 ***
## poly(dis, degree = 3, raw = TRUE)1 -0.1820817 0.0146973 -12.389 < 2e-16 ***
## poly(dis, degree = 3, raw = TRUE)2 0.0219277 0.0029329 7.476 3.43e-13 ***
## poly(dis, degree = 3, raw = TRUE)3 -0.0008850 0.0001727 -5.124 4.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.003852802)
##
## Null deviance: 6.7810 on 505 degrees of freedom
## Residual deviance: 1.9341 on 502 degrees of freedom
## AIC: -1370.9
##
## Number of Fisher Scoring iterations: 2
# Plot predictions
my.df <- data.frame(
x = Boston$dis,
y = Boston$nox,
y_pred = predict(boston.cubic.fit, newdata = Boston)
)
my.df %>% ggplot(aes(x = x)) +
geom_point(aes(y = y, col = 'Original Data')) +
geom_point(aes(y = y_pred, col = 'Degree 3 Polynomial')) +
geom_smooth(aes(y = y_pred, col = 'Degree 3 Polynomial'))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The regression output shows that all terms are statistically significant, and that the third degree polynomial follows the general trend of the data quite well.
Plot the polynomial fits for a range of different polynomial degrees (from 1 - 10) and report the associated residual sum of squares.
model.rss <- c()
for (d in 1:10) {
my.model <- lm(nox ~ poly(dis, degree = d, raw = TRUE), data = Boston)
my.model.rss <- sum(my.model$residuals^2)
model.rss <- c(model.rss, my.model.rss)
}
plot(1:10, model.rss, type = 'b', xlab = 'Polynomial Degree', ylab = 'Training RSS',
main = 'Nox vs Poly(Dis, d) RSS')
Perform cross validation or another approach to select the optimal degree for the polynomial and explain your results.
my.model.results <- c()
for (d in 1:10) {
my.model <- glm(nox ~ poly(dis, degree = d, raw = TRUE), data = Boston)
my.model.cv <- cv.glm(data = Boston, glmfit = my.model, K = 5)
my.model.cv.mse <- my.model.cv$delta[1]
my.model.results <- c(my.model.results, my.model.cv.mse)
}
plot(1:10, my.model.results, type = 'b', xlab = 'Polynomial Degree',
ylab = 'Cross-Validated MSE', main = 'Nox vs Poly(Dis, d) X-val RSS')
points(which.min(my.model.results), min(my.model.results), col = 'red', pch = 20, cex = 2)
The training RSS decreases monotonically with increasing polynomial degree, because a higher polynomial degree allows the polynomial regressor to practically interpolate the data and minimise training RSS. However, such models have very high variance and overfit the training data, and don’t generalize well to validation data.
This is why cross-validated MSE shows a degree 3 polynomial has the smallest outsample or test MSE, and suggests that we should use a degree 3 polynomial to fit the data.
Use the bs() function to fit a regression spline to predict nox using dis. Report the output for the fit using four degrees of freedom. How did you choose the knots? Plot the resulting fit.
A cubic spline - the default bs fit - with \(K\) knots has \(4 + K\) degrees of freedom. Since we want 4 degrees of freedom in the spline, this means the number of knots \(K\) is 0. This in turn means we don’t have to choose any knots. We can just fit a regular cubic polynomial.
ggplot(Boston, aes(x = dis, y = nox)) +
geom_point() +
geom_smooth(method = "lm", formula = "y ~ bs(x, df = 4)", aes(col = "B-spline (no knots)")) +
labs(x = 'ds', y = 'nox', title = 'Cubic Regression through a Basis Spline')
Now fit a regression spline for a range of degrees of freedom and plot the resulting fits and reporting the resulting RSS. Describe the results obtained.
my.model.results <- list()
for (i in 3:15) {
my.model <- lm(nox ~ bs(dis, df = i), data = Boston)
my.model.results[[paste0('df_', sprintf("%02d", i))]] <- my.model
}
training.rss <- sapply(my.model.results, function(x){ sum(x[['residuals']]^2) })
my.df <- data.frame(
model = names(my.model.results),
train_rss = unname(training.rss)
)
my.df %>% ggplot(aes(x = model, y = train_rss, group = 'train_rss')) +
geom_point() + geom_line() +
labs(x = 'Model', y = 'Training RSS', title = 'Training RSS by Degrees of Freedom in a Basis Spline')
As expected, as the degrees of freedom increase, the training RSS of the model decreases. This is because the model becomes increasingly flexible and capable of overfitting the data. However, this decrease is not monotonic: the training RSS actually increases for between degrees of freedom 8 and 9 and then again from 10 to 11.
Perform cross-validation or another approach in order to select the best degrees of freedom for a regression spline on this data. Describe your results.
my.model.results <- c()
degrees.of.freedom <- 3:15
for (i in degrees.of.freedom) {
my.model <- glm(nox ~ bs(dis, df = i), data = Boston)
my.model.cv.results <- cv.glm(data = Boston, glmfit = my.model, K = 5)
my.model.results <- c(my.model.results, my.model.cv.results$delta[1])
}
## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.1296, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.1296, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.137, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = numeric(0), Boundary.knots = c(1.137, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`50%` = 3.1827), Boundary.knots =
## c(1.1742, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases
## Warning in bs(dis, degree = 3L, knots = c(`50%` = 3.1827), Boundary.knots =
## c(1.1742, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases
## Warning in bs(dis, degree = 3L, knots = c(`50%` = 3.1121), Boundary.knots =
## c(1.1296, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases
## Warning in bs(dis, degree = 3L, knots = c(`50%` = 3.1121), Boundary.knots =
## c(1.1296, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases
## Warning in bs(dis, degree = 3L, knots = c(`33.33333%` = 2.29083333333333, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`33.33333%` = 2.29083333333333, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`33.33333%` = 2.41743333333333, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`33.33333%` = 2.41743333333333, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`25%` = 2.1185, `50%` = 3.3175, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`25%` = 2.1185, `50%` = 3.3175, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`20%` = 1.9345, `40%` = 2.5752, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`20%` = 1.9345, `40%` = 2.5752, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`20%` = 1.96982, `40%` = 2.70086, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`20%` = 1.96982, `40%` = 2.70086, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`16.66667%` = 1.9047, `33.33333%` =
## 2.4246, : some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`16.66667%` = 1.9047, `33.33333%` =
## 2.4246, : some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`16.66667%` = 1.85283333333333, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`16.66667%` = 1.85283333333333, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`14.28571%` = 1.79078571428571, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`14.28571%` = 1.79078571428571, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`14.28571%` = 1.79988571428571, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`14.28571%` = 1.79988571428571, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`12.5%` = 1.784425, `25%` = 2.1105, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`12.5%` = 1.784425, `25%` = 2.1105, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`12.5%` = 1.7312125, `25%` =
## 2.106075, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases
## Warning in bs(dis, degree = 3L, knots = c(`12.5%` = 1.7312125, `25%` =
## 2.106075, : some 'x' values beyond boundary knots may cause ill-conditioned
## bases
## Warning in bs(dis, degree = 3L, knots = c(`11.11111%` = 1.66744444444444, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`11.11111%` = 1.66744444444444, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`11.11111%` = 1.74896666666667, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`11.11111%` = 1.74896666666667, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`10%` = 1.65178, `20%` = 1.97036, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`10%` = 1.65178, `20%` = 1.97036, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`10%` = 1.6283, `20%` = 1.9669, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`10%` = 1.6283, `20%` = 1.9669, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`9.090909%` = 1.61143636363636, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`9.090909%` = 1.61143636363636, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`9.090909%` = 1.61067272727273, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`9.090909%` = 1.61067272727273, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`8.333333%` = 1.61276666666667, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`8.333333%` = 1.61276666666667, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`8.333333%` = 1.56736666666667, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`8.333333%` = 1.56736666666667, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`7.692308%` = 1.58104615384615, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`7.692308%` = 1.58104615384615, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`7.692308%` = 1.58889230769231, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
## Warning in bs(dis, degree = 3L, knots = c(`7.692308%` = 1.58889230769231, : some
## 'x' values beyond boundary knots may cause ill-conditioned bases
plot(x = degrees.of.freedom, my.model.results, xlab = 'Degrees of Freedom',
ylab = 'X-Val MSE',
main = 'X-Val MSE for Basis Splines of Different Degrees of Freedom',
type = 'b')
points(degrees.of.freedom[which.min(my.model.results)], min(my.model.results),
col = 'red', pch = 20, cex = 2)
The cross-validated MSE suggests that the optimal degrees of freedom are around 12. However, based on the 1-SE rule for in cross-validated errors, we may also use the model 5 degrees of freedom because that has a comparable magnitude but less tendency to overfit.
College GAMsThis question relates to the College data set. ### Part (a) Split the data into a training and test set. Use out-of-state tuition as the response and the other variables as the predictors to perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
# Seed the random number generator before doing a 70/30 split
set.seed(163)
train.idx <- sample(nrow(College) * 0.7, replace = FALSE)
test.idx <- (-train.idx)
college.data.train <- College[train.idx, ]
college.data.test <- College[test.idx, ]
college.data.test.mat <- model.matrix(Outstate ~., data = college.data.test)
# Perform forward forward feature selection
fwd.selection.fit <- regsubsets(
Outstate ~ .,
data = college.data.train,
nvmax = 17,
method = 'forward'
)
# Iterate over each best possible fit and calcualte the test set error
val.errors <- rep(NA, 17) # One for each variable size
for (i in 1:17) {
coef.i <- coef(fwd.selection.fit, id = i)
pred.i <- college.data.test.mat[, names(coef.i)] %*% coef.i
val.errors[i] <- mean((college.data.test$Outstate - pred.i)^2)
}
plot(val.errors, xlab = 'Subset Size', ylab = 'Test Set MSE',
main = 'College - Test Set MSE for Best Forward Selection Subset', type = 'b')
points(which.min(val.errors), min(val.errors), col = 'red', pch = 20, cex = 2)
# Which features are selected by the
A 13-variable model has the lowest validation set MSE. However, We observe that the validation set MSE for the 6-variable model is practically the same. So technically it makes sense to use the 6-variable model because it’s simpler and less likely to overfit.
Fit a GAM on the training data, using out-of-state tuition as the resposne and the features selected in the previous step as predictors. Plot the results and explain your findings.
# Subset to the get the features from the previous model
(best.subset.features <- names(coef(fwd.selection.fit, id = 6)))
## [1] "(Intercept)" "PrivateYes" "Room.Board" "PhD" "perc.alumni"
## [6] "Expend" "Grad.Rate"
college.train.data.matrix <- model.matrix(Outstate ~., data = college.data.train)
college.subset.train.data <- college.train.data.matrix[, best.subset.features]
# Train a GAM with the these features: simplest approach is to use a natural spline
# for all conitnuous features
college.train.gam <- gam(
Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + s(Expend) +
s(Grad.Rate),
data = college.data.train
)
# Use `plot` to display the relationship between each spline and response
par(mfrow = c(2, 3))
plot(college.train.gam, se = T)
Individual splines don’t seem to do that well at the extremities of their corresponding features. However, there is definitely a non-linear relationship here which the splines do capture.
Evaluate the model obtained on the test set, and explain the results obtained.
# MSE
mean((predict(college.train.gam, newdata = college.data.test) - college.data.test$Outstate)^2)
## [1] 4104102
# Generic form of R-squared: 1 - RSS / TSS
test_TSS <- sum((college.data.test$Outstate - mean(college.data.test$Outstate))^2)
test_RSS <- sum((predict(college.train.gam, newdata = college.data.test) - college.data.test$Outstate)^2)
1 - test_RSS/test_TSS
## [1] 0.7513565
R-squared statistic looks quite good, however MSE is quite high.
For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(college.train.gam)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) +
## s(Expend) + s(Grad.Rate), data = college.data.train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6741.75 -1044.33 37.93 1159.33 7139.55
##
## (Dispersion Parameter for gaussian family taken to be 3253406)
##
## Null Deviance: 8614032615 on 542 degrees of freedom
## Residual Deviance: 1695024814 on 521 degrees of freedom
## AIC: 9706.91
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 1810946422 1810946422 556.631 < 2.2e-16 ***
## s(Room.Board) 1 1703398779 1703398779 523.574 < 2.2e-16 ***
## s(PhD) 1 667322951 667322951 205.115 < 2.2e-16 ***
## s(perc.alumni) 1 343712478 343712478 105.647 < 2.2e-16 ***
## s(Expend) 1 802314787 802314787 246.608 < 2.2e-16 ***
## s(Grad.Rate) 1 112120479 112120479 34.462 7.742e-09 ***
## Residuals 521 1695024814 3253406
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board) 3 3.629 0.012948 *
## s(PhD) 3 3.846 0.009642 **
## s(perc.alumni) 3 1.027 0.380413
## s(Expend) 3 36.475 < 2.2e-16 ***
## s(Grad.Rate) 3 2.797 0.039645 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The anova for parametric effects summary shows that all predictors are statistically significant in the context of a linear relationship with the response.
However, anova for non-parametric effects’ \(p\)-values correspond to a null hypothesis of a linear effect and therefore a small \(p\)-value provides evidence for the alternative hypothesis of a non-linear relationship.
The smallest \(p\)-value is for Expend, which means expenditure definitely has a non-linear relationship with the response. This is followed by PhD and Grad.Rate which are both below the 5% threshold. However, perc.alumni is not statistically significant as a non-linear predictor.
We will now explore backfitting in the context of multiple linear regression.
Suppose that we would like to perform multiple linear regression, but don’t have the software do to so. Instead, we only have software to perform simple linear regression. Therefore, we take the following approach: 1. we repeatedly hold all but one coefficient estimate fixed at its current value 2. we update that coefficient estimate using a simple linear regression 3. we continue this process until convergence i.e. until the coefficient estimates stop changing.
We now try this out on a toy example.
Generate a response \(Y\) and two predictors \(X_1\) and \(X_2\) with \(n = 100\).
n <- 100
x1 <- rnorm(n = n, mean = 16.3, sd = 1)
x2 <- rnorm(n = n, mean = 19.4, sd = 1)
eps <- rnorm(n = n, mean = 0, sd = 1)
y <- 164 + 168 * x1 + 177 * x2 + eps
Initialize \hat{\(\beta_1\)| to be a value of your choice. It does not matter what value you choose.
beta.1 <- 10
Keeping \(\hat{\beta_1}\) fixed, fit the model \[ Y - \hat{\beta_1}X_1 = \beta_0 + \beta_2X_2 + \epsilon \]
# We find the residual between y and the product of our current estimate of
# beta and the corresponding predictor.
my.resid <- y - beta.1 * x1
# We then fit a linear model to this residual as a function of the second predictor
beta.2 <- lm(my.resid ~ x2)$coef[2]
Keeping \(\hat{\beta_2}\) fixed, fit the model \[ Y - \hat{\beta_2}X_2 = \beta_0 + \beta_1X_1 + \epsilon \]
# Repeat the same process here
my.resid <- y - beta.2 * x2
beta.1 <- lm(my.resid ~ x1)$coef[2]
Write a for loop to repeat the steps in (c) and (d) 1000 times. Report the estimates of \(\hat{\beta_1}\) and \(\hat{\beta_2}\) at each iteration of the for loop. Create a plot in which each of these values is displayed, with \(\hat{\beta_0}, \hat{\beta_1}, \hat{\beta_2}\) shown in a different color.
# Initialise vectors to store coefficient estimates
all.beta.0 <- c()
all.beta.1 <- c()
all.beta.2 <- c()
# Reinitialize beta.1 to the original value
beta.1 <- 10
for (i in 1:1000) {
# Fit residual as a function of beta.2
my.resid <- y - beta.1 * x1
beta.2 <- lm(my.resid ~ x2)$coefficients[2]
all.beta.2[i] <- beta.2
# Fit residual as a function of beta.1
my.resid <- y - beta.2 * x2
my.lm <- lm(my.resid ~ x1)
beta.1 <- my.lm$coefficients[2]
beta.0 <- my.lm$coefficients[1]
# update history
all.beta.1[i] <- beta.1
all.beta.0[i] <- beta.0
}
coef.df <- data.frame(iter = 1:1000,
beta.0 = all.beta.0, beta.1 = all.beta.1, beta.2 = all.beta.2)
# plot the variation in backfitted linear models
coef.df %>%
data.table::melt(id.vars = 'iter') %>%
ggplot(aes(x = iter, y = value, color = variable, group = variable)) +
geom_line() +
labs(x = 'Iteration', y = 'Coefficient Value',
title = 'Backfitting - Variation in Coefficient Estimates by Iteration')
## Warning in data.table::melt(., id.vars = "iter"): The melt generic in data.table
## has been passed a data.frame and will attempt to redirect to the relevant
## reshape2 method; please note that reshape2 is deprecated, and this redirection
## is now deprecated as well. To continue using melt methods from reshape2 while
## both libraries are attached, e.g. melt.list, you can prepend the namespace like
## reshape2::melt(.). In the next version, this warning will become an error.
Cmpare yur answer in (e) to the results of simply performing a multiple linear regression to predict \(Y\) using \(X_1\) and \(X_2\). Use the abline function to overlay those multiple linear regression coefficient estimates on the plot obtained in (e).
my.lm.fit <- lm(y ~ x1 + x2)
plot(coef(my.lm.fit), col = 'red')
points(c(all.beta.0[1000], all.beta.1[1000], all.beta.2[1000]), cex = 2)
The final converged estimates shown in blue are exactly the same as the linear model estimates shown in red.
On this data set, how many backfitting iterations were required to obtain a good approximation to the multiple regression coefficient estimates?
6 iterations.
This problem is a continuation of the previous exercise. In a toy example with \(p = 100\), show that one can approximate the multiple linear regression coefficient estimates by repeaedly performing simple linear regression in a backfitting procedure. How many backfitting iterations are required in order to obtain a “good” approximation to the multiple linear regression coefficient estimates? Create a plot to justify your answer.
set.seed(163)
# Simulate the coefficients
beta.0.true <- 10
beta.true <- rnorm(n = 100, mean = 16.3, sd = 19.4)
# Generate a features matrix with p predictors
X <- matrix(rnorm(5000 * 100, mean = 10, sd = 36), nrow = 5000, ncol = 100)
# Generate random noise
eps <- rnorm(5000, mean = 0, sd = 9000)
# Generate the response
Y <- beta.0.true + X %*% beta.true + eps
# To start off, we can easily fit a linear regressor to this data
my.lm.fit <- lm(Y ~ X)
summary(my.lm.fit)
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32167 -5830 213 5884 31668
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 453.6259 379.8106 1.194 0.232400
## X1 1.5455 3.5355 0.437 0.662030
## X2 20.0738 3.4906 5.751 9.42e-09 ***
## X3 24.7143 3.5134 7.034 2.28e-12 ***
## X4 -12.1777 3.5587 -3.422 0.000627 ***
## X5 14.3727 3.5660 4.031 5.65e-05 ***
## X6 -0.5297 3.5051 -0.151 0.879897
## X7 13.1772 3.5007 3.764 0.000169 ***
## X8 27.3131 3.5031 7.797 7.71e-15 ***
## X9 22.1980 3.4994 6.343 2.45e-10 ***
## X10 17.3560 3.5323 4.914 9.24e-07 ***
## X11 8.2951 3.4900 2.377 0.017501 *
## X12 2.1982 3.5530 0.619 0.536161
## X13 47.1126 3.5316 13.340 < 2e-16 ***
## X14 23.4010 3.5594 6.574 5.39e-11 ***
## X15 30.5206 3.4549 8.834 < 2e-16 ***
## X16 -10.7363 3.4871 -3.079 0.002090 **
## X17 -26.6194 3.4318 -7.757 1.05e-14 ***
## X18 -15.7610 3.4873 -4.519 6.35e-06 ***
## X19 -21.9605 3.5272 -6.226 5.18e-10 ***
## X20 17.3388 3.4390 5.042 4.78e-07 ***
## X21 51.1380 3.5321 14.478 < 2e-16 ***
## X22 -11.2664 3.5406 -3.182 0.001471 **
## X23 38.6281 3.5064 11.016 < 2e-16 ***
## X24 36.9557 3.4975 10.566 < 2e-16 ***
## X25 64.7999 3.4646 18.703 < 2e-16 ***
## X26 30.6328 3.5123 8.722 < 2e-16 ***
## X27 28.3919 3.5149 8.078 8.25e-16 ***
## X28 -16.8038 3.5065 -4.792 1.70e-06 ***
## X29 22.3748 3.5017 6.390 1.82e-10 ***
## X30 17.4231 3.5354 4.928 8.57e-07 ***
## X31 52.9349 3.5558 14.887 < 2e-16 ***
## X32 27.2629 3.5289 7.726 1.34e-14 ***
## X33 21.0503 3.5037 6.008 2.01e-09 ***
## X34 -5.4687 3.5103 -1.558 0.119315
## X35 -4.4043 3.4475 -1.278 0.201473
## X36 26.9807 3.5600 7.579 4.14e-14 ***
## X37 27.7897 3.5183 7.899 3.45e-15 ***
## X38 -5.8550 3.4975 -1.674 0.094182 .
## X39 5.9723 3.4640 1.724 0.084758 .
## X40 64.3628 3.4912 18.436 < 2e-16 ***
## X41 34.2668 3.5898 9.546 < 2e-16 ***
## X42 25.1816 3.4497 7.300 3.35e-13 ***
## X43 6.5551 3.4571 1.896 0.058001 .
## X44 -7.6784 3.5543 -2.160 0.030798 *
## X45 34.0683 3.5869 9.498 < 2e-16 ***
## X46 37.0388 3.4991 10.585 < 2e-16 ***
## X47 7.5719 3.4902 2.169 0.030096 *
## X48 -21.2556 3.5187 -6.041 1.65e-09 ***
## X49 -8.6081 3.4947 -2.463 0.013804 *
## X50 35.4437 3.5240 10.058 < 2e-16 ***
## X51 -5.0051 3.5321 -1.417 0.156534
## X52 35.9544 3.5082 10.249 < 2e-16 ***
## X53 12.9572 3.5045 3.697 0.000220 ***
## X54 26.5549 3.5273 7.528 6.08e-14 ***
## X55 22.2878 3.4927 6.381 1.92e-10 ***
## X56 0.7590 3.5450 0.214 0.830473
## X57 -22.7166 3.4395 -6.605 4.41e-11 ***
## X58 -1.5120 3.4852 -0.434 0.664424
## X59 -13.1792 3.4990 -3.767 0.000167 ***
## X60 19.6157 3.4895 5.621 2.00e-08 ***
## X61 5.5461 3.5148 1.578 0.114645
## X62 9.6947 3.5030 2.768 0.005669 **
## X63 10.7000 3.4271 3.122 0.001806 **
## X64 17.3983 3.4810 4.998 5.99e-07 ***
## X65 -2.4552 3.4205 -0.718 0.472921
## X66 -3.9880 3.5694 -1.117 0.263927
## X67 13.7566 3.5026 3.928 8.70e-05 ***
## X68 40.9009 3.5750 11.441 < 2e-16 ***
## X69 -20.8557 3.4971 -5.964 2.64e-09 ***
## X70 69.1847 3.5327 19.584 < 2e-16 ***
## X71 19.1413 3.5125 5.449 5.30e-08 ***
## X72 53.6097 3.5075 15.284 < 2e-16 ***
## X73 27.2000 3.5477 7.667 2.11e-14 ***
## X74 24.9228 3.5110 7.099 1.44e-12 ***
## X75 18.3462 3.5020 5.239 1.68e-07 ***
## X76 47.5273 3.5265 13.477 < 2e-16 ***
## X77 28.4356 3.4636 8.210 2.81e-16 ***
## X78 17.7151 3.4953 5.068 4.16e-07 ***
## X79 11.6622 3.4718 3.359 0.000788 ***
## X80 31.6932 3.5390 8.955 < 2e-16 ***
## X81 -2.5952 3.5670 -0.728 0.466917
## X82 6.7965 3.5215 1.930 0.053666 .
## X83 26.7149 3.4519 7.739 1.21e-14 ***
## X84 26.4704 3.4862 7.593 3.72e-14 ***
## X85 5.7491 3.5364 1.626 0.104081
## X86 24.1989 3.5084 6.897 5.97e-12 ***
## X87 25.0596 3.5120 7.135 1.11e-12 ***
## X88 -6.2818 3.4813 -1.804 0.071228 .
## X89 11.9067 3.5330 3.370 0.000757 ***
## X90 24.2462 3.4524 7.023 2.47e-12 ***
## X91 -19.9463 3.4896 -5.716 1.16e-08 ***
## X92 2.4798 3.5203 0.704 0.481190
## X93 -18.3690 3.4896 -5.264 1.47e-07 ***
## X94 30.9267 3.4921 8.856 < 2e-16 ***
## X95 1.0069 3.5262 0.286 0.775225
## X96 -13.4560 3.5370 -3.804 0.000144 ***
## X97 4.9635 3.4902 1.422 0.155052
## X98 -5.1781 3.4914 -1.483 0.138108
## X99 2.9421 3.5123 0.838 0.402266
## X100 36.7038 3.4738 10.566 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8848 on 4899 degrees of freedom
## Multiple R-squared: 0.5176, Adjusted R-squared: 0.5078
## F-statistic: 52.57 on 100 and 4899 DF, p-value: < 2.2e-16
For backfitting, initialize all the coefficients randomly.
beta.0.temp <- NA
beta.all.temp <- rep(NA, 100)
set.seed(163)
beta.all.temp[2:100] <- rnorm(99)
Need to implement a nexted for loop. The outer for loop controls the number of backfitting iterations. Within each back fitting iteration, we iterate over all the predictors to compute residuals and fit linear models to them.
# Initialize a matrix to store estimates for each predictor at each iteration
beta.matrix <- matrix(nrow = 1 + 50, ncol = 1 + 100)
# First row is initialization
beta.matrix[1, 3:101] <- beta.all.temp[2:100]
for (i in 1:50) {
for (p in 1:100) {
residual.p <- Y - X[, -p] %*% beta.all.temp[-p]
beta.all.temp[p] <- lm(residual.p ~ X[, p])$coefficients[2]
}
beta.0.temp <- lm(residual.p ~ X[, p])$coefficients[1]
# Update the matrix
beta.matrix[i + 1, ] <- c(beta.0.temp, beta.all.temp)
}
We can drop the first row of the beta.matrix because it has randomly initialized values of \(\hat{\beta_j}\ \forall j \in [2, p]\) that were needed for the first iteration of the backfitting process.
To test whether this was a good fit for the data or not, we can compare the RMSE between the coefficient estimates at each iteration.
param.err <- c()
# For each iteration
for (i in 1:50) {
param.err.i <- sqrt(sum((beta.matrix[i, ] - my.lm.fit$coefficients)^2))
param.err[i] <- param.err.i
}
my.err.df <- data.frame(iteration = 1:50, param_err = param.err)
my.err.df %>% ggplot(aes(x = iteration, y = param_err)) +
geom_point() +
geom_line() +
labs(x = 'Iteration', y = 'RMSE(Coeff Values)',
title = 'Backfitting on Simulated Data - Parameter Error Summary by Iteration')
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 row(s) containing missing values (geom_path).
It looks like parameter error drops literally after the first iteration.
my.err.df %>%
dplyr::filter(iteration < 5) %>%
ggplot(aes(x = iteration, y = param_err)) +
geom_point() +
geom_line() +
labs(x = 'Iteration', y = 'RMSE(Coeff Values)',
title = 'Backfitting on Simulated Data - Parameter Error Summary by Iteration')
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 row(s) containing missing values (geom_path).